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(54) Method and system for establishing a quorum for a geographically distributed cluster of 
computers 



(57) One embodiment of the present invention pro- 
vides a system that facilitates establishing a quorum for 
a cluster of computers that are geographically distribut- 
ed. The system operates by detecting a change in mem- 
bership of the cluster. Upon detecting the change, the 
system forms a potential new cluster by attempting to 
communicate with all other computers within the cluster. 



The system accumulates votes for each computer suc- 
cessfully contacted. The system also attempts to gain 
control of a quorum server located at a site separate 
from all computers within the cluster. If successful at 
gaining control, the system accumulates the quorum 
server's votes as well. If the total of accumulated votes 
is a majority of the available votes, the system forms a 
new cluster from the potential new cluster. 
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Description 

[0001] The present invention relates to computer 
clusters. More specifically, the present invention relates 
to a method and system for establishing a quorum for a 
geographically distributed computer cluster. 
[0002] Corporate intranets and the Internet are cou- 
pling more and more computers together to provide 
computer users with an ever-widening array of services. 
Many of these services are provided through the client- 
server model in which a client communicates with a 
serverto have the server perform an action forthe client 
or provide data to the client. A server may have to pro- 
vide these services to many clients simultaneously and, 
therefore, must be fast and reliable. 
[0003] In an effort to provide speed and reliability with- 
in servers, designers have developed clustering sys- 
tems for the servers. Clustering systems couple multiple 
computers — also called computing nodes or simply 
nodes — together to function as a single unit. It is desir- 
able for a cluster to continue to function correctly even 
when a node has failed or the communication links be- 
tween the nodes have failed. 

[0004] In order to accomplish this, nodes of a cluster 
typically send "heartbeat" messages to each other reg- 
ularly over private communication links. Failure to re- 
ceive a heartbeat message from a node for a period of 
time indicates that either the node has failed orthe com- 
munication links to the node have failed. 
[0005] In the event of a failure, the remaining nodes 
can perform a recovery procedure to allow operations 
to continue without the failed node. By continuing oper- 
ations without the failed node, the cluster provides high- 
er availability. Note that when a failure of a node is de- 
tected, the surviving nodes must come to an agreement 
on the cluster membership. 

[0006] Failures of communication links can cause two 
problems: "split-brain" and "amnesia", which can be 
viewed as a partition in space and a partition in time, 
respectively. The split-brain problem occurs if a commu- 
nication failure partitions the cluster into two (or more) 
functioning sub-groups. Each sub-group will not be able 
to receive heartbeat messages from the nodes in other 
sub-groups. Potentially, each sub-group could decide 
that the nodes in the other sub-group have failed, take 
control of devices normally belonging to the other sub- 
group, and restart any applications that were running on 
the other sub-group. The result is that two different sub- 
groups are trying to control the same devices and run 
the same applications. This can cause data corruption 
if one sub-group overwrites data belonging to the other 
sub-group and application-level corruption because the 
applications in each sub-group are unaware that anoth- 
er copy of the application is running. 
[0007] The amnesia problem occurs if one sub-group 
makes data modifications while the nodes in another 
sub-group have failed. If the cluster is then restarted with 
thefailed sub-group running and the formerly operation- 



al sub-group not running, the data modifications can po- 
tentially disappear. 

[0008] A standard solution to the split-brain problem 
is to provide a quorum mechanism. Each node in a clus- 
5 ter is assigned a number of votes. All of the operational 
nodes within a sub-group pool their votes and if the sub- 
group has a majority of votes it is permitted to form a 
new cluster and continue operation. For example, in a 
three-node cluster, each node can be given one vote. If 
the cluster is partitioned by a network failure into a two- 
node sub-group and a one-node sub-group, the two- 
node sub-group has two votes and the one-node sub- 
group has one vote. Only the two-node sub-group will 
be permitted to form a new cluster, while the one-node 
sub-group will cease operation. 

[0009] With a two-node cluster, it is desired that either 
node can continue operation if the other node fails. How- 
ever, the quorum mechanism described above does not 
permit either node to function alone. If each node has 
one vote, neither node running alonecan achieve a quo- 
rum majority. Majority can be attained if, for example, 
one node gets two votes and the other gets one. This 
solution allows only the former node to run alone, but 
will prevent the latter from running alone. 
[0010] A solution to the two-node quorum problem is 
to introduce a quorum device, which can be viewed as 
a vote "tie-breaker." For example, a disk drive, which 
supports small computer system interface (SCSI) res- 
ervations, can be used for a quorum device. The SCSI 
reservation mechanism allows one node to reserve the 
disk drive. The other node can then detect that the disk 
drive has been reserved. In operation, the quorum de- 
vice is assigned an additional vote. If a network failure 
partitions the cluster, both nodes will attempt to reserve 
the SCSI disk. The node that succeeds will obtain the 
additional vote of the quorum device and will have two 
out of three votes and will become the surviving cluster 
member. The other node will have only one vote and 
thus will not become a cluster member. 
[0011] Note that the link from a node to the quorum 
device must be independent of the link between nodes. 
Otherwise, a single link failure could cause failure of 
both inter-node communication and communication 
with the quorum device. In this case, neither node would 
be able to get two votes and the cluster, as a whole, 
would fail. 

[0012] To prevent amnesia, each node keeps a copy 
of state data. When nodes join a cluster, they get up-to- 
date state data from the other nodes in the cluster. By 
requiring a majority of votes, the new cluster will have 
at least one node that was in the previous cluster, there- 
fore ensuring up-to-date state data within the new clus- 
ter. 

[001 3] The previous discussion has assumed that the 
nodes of the cluster are located physically near each 
other, so that the nodes can be coupled to each other 
and to the quorum device through separate links. How- 
ever, in many cases users wish to have a two-node clus- 
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ter with nodes that are widely separated, by potentially 
thousands of miles, in order to provide reliability in the 
event of a local disaster. This separation poses prob- 
lems for the quorum configuration. If the quorum device 
is located with either node, a disaster at that site could 
destroy both the node and the quorum device, effective- 
ly preventing the other node from taking control. In ad- 
dition, connecting a quorum device such as a SCSI disk 
over these long distances can be extremely expensive 
or impossible. 

[0014] Accordingly, one embodiment of the present 
invention provides a system that facilitates establishing 
a quorum for a cluster of computers that are geograph- 
ically distributed. The system operates by detecting a 
change in membership of the cluster. Upon detectingthe 
change, the system forms a potential new cluster by at- 
temptingto communicate with all other computers within 
the cluster. The system accumulates votes for each 
computer successfully contacted. The system also at- 
tempts to gain control of a quorum server located at a 
site separate from all computers within the cluster. If 
successful at gaining control, the system accumulates 
the quorum server's vote or votes as well. If the total of 
accumulated votes comprises a majority of the available 
votes, the system forms a new cluster from the potential 
new cluster. 

[0015] In one embodiment of the present invention, 
the system exchanges heartbeat messages with all oth- 
er computers that are part of the cluster. Upon discov- 
ering an absence of heartbeat messages from any com- 
puter in the cluster, the system initiates a cluster mem- 
bership protocol. 

[0016] In one embodiment of the present invention, 
detecting the change in cluster membership includes 
detecting that the cluster has not been formed. 
[0017] In one embodiment of the present invention, 
attempting to gain control of the quorum server involves 
communicating with the quorum server using crypto- 
graphic techniques. 

[0018] In one embodiment of the present invention, 
the system exchanges a status message with each 
member of the new cluster. The system updates the lo- 
cal status of the computer to the most recent status 
available within the status messages. 
[0019] Another embodiment of the present invention 
provides a system that facilitates establishing a quorum 
for a cluster of computers that are geographically dis- 
tributed. The system provides a quorum server at a site 
separatefrom a location of any computer within the clus- 
ter. The system assigns at least one vote to each com- 
puter within the cluster. The system also assigns at least 
one vote to the quorum server. In operation, the system 
attempts to establish communications between each 
pair of computers within the cluster. A count of votes is 
accumulated at each computer for each computer that 
responds. The system also attempts to establish control 
over the quorum server from each computer within the 
cluster. If control is established over the quorum server, 



the quorum server's vote(s) are accumulated in the 
count of votes. The system establishes a quorum when 
a majority of available votes has been accumulated in 
the count of votes. 
5 [0020] In one embodiment of the present invention, 
the quorum server grants control to only a first computer 
attempting to establish control. Another approach is for 
the quorum server to grant control to only one computer 
out of all the computers attempting to establish control 
10 based on a pre-established priority list. 

[0021] In one embodiment of the present invention, 
votes are assigned so that the quorum includes at least 
one computer that was in an immediately previous clus- 
ter. This ensures that a cluster formed from the quorum 
15 has current data. 

[0022] In one embodiment of the present invention, 
attempting to establish control over the quorum server 
involves establishing communications with the quorum 
server. Note that cryptographic techniques may be em- 
20 ployed here to deter attacks. 

[0023] Various embodiments in accordance with the 
invention will now be described in detail by way of ex- 
ample only, with reference to the following drawings: 

25 FIG. 1 illustrates a geographically distributed clus- 
ter of computers coupled together in accordance 
with one embodiment of the present invention. 
FIG. 2 is a flowchart illustrating the process of de- 
tecting and processing a failure within a cluster in 
30 accordance with one embodiment of the present in- 
vention. 

FIG. 3 is a flowchart illustrating the process of de- 
termining cluster membership, such as may be 
used in the process of FIG. 2. 
35 FIG. 4 is a flowchart illustrating the process of grant- 
ing control of a quorum server such as shown in 
FIG. 1. 

FIG. 5 is a flowchart illustrating the process of 
reconfiguring a computer within a cluster, such as 
40 may be used in the process of FIG. 2. 

Computer Cluster 

[0024] FIG. 1 illustrates a geographically distributed 
45 cluster of computers coupled together in accordance 
with one embodiment of the present invention. Comput- 
ers 102 and 1 04 form a cluster of computers that operate 
in concert to provide services and data to users. Two or 
more computers are formed into a cluster to provide 
50 speed and reliability for the users. Computers 1 02 and 
104 are located in geographic areas 120 and 122 re- 
spectively. Geographic areas 120 and 122 are widely 
separated, possibly by thousands of miles, in order to 
provide survivability for the cluster in case of a local dis- 
55 aster at geographic area 120 or 122. For example, ge- 
ographic area 120 may be located in California, while 
geographic area 122 may be located in New York. 
[0025] Computers 1 02 and 1 04 can generally include 
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any type of computer system, including, but not limited 
to, a computer system based on a microprocessor, a 
mainframe computer, a digital signal processor, a port- 
able computing device, a personal organizer, a device 
controller, and a computational engine within an appli- 
ance. 

[0026] Computers 1 02 and 1 04 communicate across 
private network 108. Private network 108 may include 
at least two independent links of communication be- 
tween computers 102 and 104 to provide redundancy 
to allow uninterrupted communications in case one of 
the links fails. Private network 1 08 can generally include 
any type of wired or wireless communication channel 
capable of coupling together computing nodes. This in- 
cludes, but is not limited to, a local area network, a wide 
area network, or a combination of networks. 
[0027] Computers 102 and 104 are also coupled to 
public network 110 to allow communication with users. 
Public network 110 can generally include any type of 
wired or wireless communication channel capable of 
coupling together computing nodes. This includes, but 
is not limited to, a local area network, a wide area net- 
work, or a combination of networks. In one embodiment 
of the present invention, public network 11 0 includes the 
Internet. 

[0028] Quorum server 1 06 provides quorum services 
to computers 102 and 104. Additionally, quorum server 
106 can provide quorum services to other clusters of 
computers independent of computers 102 and 104. 
Quorum server 106 is located in geographic area 124, 
which is separate from geographic areas 120 and 122. 
For example, geographic area 124 may be located in 
Illinois. 

[0029] Quorum server 1 06 can generally include any 
type of computer system, including, but not limited to, a 
computer system based on a microprocessor, a main- 
frame computer, a digital signal processor, a portable 
computing device, a personal organizer, a device con- 
troller, a computational engine within an appliance, and 
a cluster of computers. There may also be multiple quo- 
rum servers at different sites. 

[0030] Computers 102 and 104 communicate with 
quorum server 1 06 across communication links 1 1 6 and 
1 1 8 respectively. Communication links 1 1 6 and 1 1 8 can 
be low bandwidth communication links such as dial-up 
modem connections. Communication links 1 1 6 and 1 1 8 
are typically used only during configuration or re-config- 
uration of the cluster. These links may also be the same 
network as public network 110, e.g., the Internet. 

Cluster Failures 

[0031] FIG. 2 is a flowchart illustrating the process of 
detecting and processing a failure within a cluster in ac- 
cordance with one embodiment of the present invention. 
The system starts when a computer, say computer 1 02, 
exchanges a heartbeat message with every other node 
in the cluster (step 202). Next, computer 1 02 checks for 



a failure to receive heartbeats via one of the links on 
private network 108 (step 204). If there are no link fail- 
ures, the process returns to 202 to repeat exchanging 
heartbeat messages. 
5 [0032] If computer 102 detects a failure in the links to 
another node on private network 1 08, computer 1 02 de- 
termines if all links to the other node have failed to pro- 
vide heartbeats (step 206). If all links to the other node 
have not failed, the process returns to 202 to repeat ex- 
10 changing heartbeat messages. Otherwise, either all 
links have failed, or the other node has failed. 
[0033] If computer 1 02 detects that all links to the oth- 
er node have failed to provide heartbeats, computer 1 02 
attempts to exchange messages with other communi- 
15 eating nodes to initiate a cluster membership protocol 
(step 208). The surviving nodes then co-operate to de- 
termine membership for a new cluster (step 210). De- 
tails of determining membership for the new cluster are 
described below in conjunction with FIG. 3. 
20 [0034] After determining cluster membership, compu- 
ter 102 determines if computer 102 was excluded from 
membership (step 212). If computer 102 was excluded 
from membership, computer 1 02 shuts down (step 214). 
Otherwise, computer 102 reconfigures (step 216). De- 
25 tails of how computer 102 reconfigures are described 
below in conjunction with FIG. 5. The reconfiguration al- 
gorithm ensures that each computer reaches consistent 
membership decisions, therefore, each computerwill ei- 
ther be part of the new cluster or will shut down. 

30 

Determining Cluster Membership 

[0035] FIG. 3 is a flowchart illustrating the process of 
determining cluster membership in accordance with one 

35 embodiment of the present invention. The system starts 
when a computer, for example computer 102, attempts 
to take control of quorum server 1 06 (step 302). Wheth- 
er successful or not, computer 102 accumulates votes 
from all other computers contacted plus, if computer 1 02 

40 successfully took control of quorum server 1 06, the vote 
(s) of quorum server 106 (step 304). 
[0036] Next computer 102 informs all other nodes 
how many votes have been attained (step 306). Com- 
puter 1 02 then determines if the group has captured the 

45 majority of votes (step 308). If the majority of votes have 
not been captured, computer 102 determines if it was 
part of the previous cluster (step 31 0). If computer 1 02 
was part of the previous cluster, the process returns to 
step 304 and continues to accumulate votes, otherwise, 

50 computer 102 shuts down (step 312). 

[0037] If a majority of votes have been captured at 
308, computer 102 determines a fully connected set of 
the responding computers (step 31 4). Computer 1 02 us- 
es well-known graphing techniques to determine a fully 

55 connected set of responding computers, therefore this 
process will not be discussed further. 
[0038] After a fully connected set of computers has 
been determined, computer 102 informs the other 
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nodes of the membership of the new cluster (step 31 6). 
Note that the above steps are being accomplished by 
all computers in the system simultaneously. 

Controlling Quorum Server 

[0039] FIG. 4 is a flowchart illustrating the process of 
granting control of quorum server 106 in accordance 
with one embodiment of the present invention. The sys- 
tem starts when quorum server 106 receives a request 
for control from a node in the proposed new cluster (step 
402). Next, quorum server 106 determines if the re- 
questing node was on the list of nodes for the previous 
cluster (step 404). If the requesting node was not on the 
list of nodes for the previous cluster, quorum server 1 06 
determines if the list of nodes for the previous cluster is 
empty (step 406). Note that an empty list indicates that 
a cluster had never been formed and this request is part 
of initializing a cluster for the first time. If the cluster list 
is not empty at 406, quorum server 106 denies the re- 
quest to control quorum server 106 (step 408). 
[0040] If the node was on the previous cluster list at 
404 or if the cluster list is empty at 406, quorum server 
106 sets the cluster list to contain only the requesting 
node (step 41 0). (It will be apparent to a person of ordi- 
nary skill in the art that there are other ways to reset the 
list, including receiving a list of nodes from the request- 
ing node to include in the list, or receiving a list of nodes 
from the requesting node to exclude from the list). Final- 
ly, quorum server 106 affirms the request to control quo- 
rum server 1 06 and grants its vote(s) to the requesting 
node (step 412). 

Reconfiguring a Computer 

[0041] FIG. 5 is a flowchart illustrating the process of 
reconfiguring a computer within a cluster in accordance 
with one embodiment of the present invention. The sys- 
tem starts when a computer, say computer 102, re- 
ceives status data from other nodes in the new cluster 
(step 502). Next, computer 1 02 determines which set of 
status data is the most recent (step 504). 
[0042] Computer 1 02 updates its own internal status 
to conform with the most recent status data available 
(step 506). Finally, computer 1 02 informs quorum server 
106 which nodes to include in the new cluster list (step 
508). 

[0043] The data structures and code described herein 
for implementing the establishment of a quorum are typ- 
ically stored on a computer readable storage medium, 
which may be any device or medium that can store code 
and/or data for use by a computer system. This includes, 
but is not limited to, magnetic and optical storage devic- 
es such as disk drives, magnetic tape, CDs (compact 
discs) and DVDs (digital versatile discs or digital video 
discs), and computer instruction signals embodied in a 
transmission medium (with or without a carrier wave up- 
on which the signals are modulated). For example, the 



transmission medium may include a communications 
network, such as the Internet. 

[0044] The foregoing description of various embodi- 
ments of the present invention has been provided in the 
5 context of a particular application, and for the purpose 
of illustration only. Many other modifications and varia- 
tions will be apparent to practitioners skilled in the art, 
and so the scope of the present invention is not limited 
to the particular embodiments shown, but rather is de- 
10 fined by the appended claims and equivalents thereof. 



Claims 

15 1. A method for facilitating the establishment of a quo- 
rum for a cluster within a plurality of computers that 
are geographically distributed, the method compris- 
ing the steps of: 

20 detecting a change in membership of the clus- 

ter at a computer within the plurality of comput- 
ers; and 

upon detecting the change in membership, 
forming a potential new cluster by attempting 
25 to communicate with 

all other computers within the plurality of com- 
puters, accumulating votes for each computer 
successfully contacted, 

attempting to gain control of a quorum server 
30 located at a site separate from all computers 

within the plurality of computers, 
if successful, accumulating the quorum serv- 
er's votes, and 

if the total of accumulated votes represents a 
35 majority of the available votes, forming a new 

cluster from the potential new cluster. 

2. The method of claim 1 , wherein the step of detecting 
a change in membership includes the steps of: 

40 

exchanging heartbeat messages with all com- 
puters that are part of the cluster; and 
upon discovering an absence of a heartbeat 
message from any computer in the cluster, ini- 
45 tiating a cluster membership protocol. 

3. The method of claim 1 , wherein the step of detecting 
the change in cluster membership includes detect- 
ing that the cluster has not been formed. 

50 

4. The method of any preceding claim, wherein the 
step of attempting to gain control of the quorum 
server includes communicating with the quorum 
server using cryptographic techniques. 

55 

5. The method of any preceding claim, further com- 
prising the steps of: 
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exchanging a status message with each mem- 
ber of the new cluster; and 
updating the local status at the computer to the 
most recent status available within the status 
message. 

6. A method to facilitate establishing a quorum for a 
cluster within a plurality of computers that are geo- 
graphically distributed, the method comprising the 
steps of: 

providing a quorum server at a site separate 
from the location of a computer within the plu- 
rality of computers; 

assigning at least one vote to each computer 
within the plurality of computers; 
assigning at least one vote to the quorum serv- 
er; 

attempting to establish communications be- 
tween each pair of computers within the plural- 
ity of computers; 

accumulating a count of votes for each compu- 
ter communicated with at each computer; 
attempting to establish control over the quorum 
server from each computer within the plurality 
of computers; 

if control is established overthe quorum server, 
accumulating the quorum server's vote(s) in the 
count of votes; and 

establishing the quorum when a majority of the 
available votes has been accumulated in the 
count of votes. 

7. The method of claim 6, wherein the quorum server 
grants control to only a first computer attempting to 
establish control. 

8. The method of claim 6, wherein the quorum server 
grants control to only one computer of all computers 
attempting to establish control based on a pre-es- 
tablished priority list. 

9. The method of any of claims 6 to 8, wherein votes 
are assigned so that the quorum includes at least 
one computer that was in an immediately previous 
cluster, to ensure that a clusterformed from the quo- 
rum has current data. 

10. The method of any of claims 6 to 9, wherein the step 
of attempting to establish control over the quorum 
server involves establishing communications with 
the quorum server. 

1 1 . The method of claim 1 0, wherein the step of estab- 
lishing communications with the quorum server in- 
volves using cryptographic techniques. 

12. A computer program comprising instructions that 



when executed by a computer cause the computer 
to perform a method according to any preceding 
claim. 

5 13. A computer-readable storage medium storing in- 
structions that when executed by a computer cause 
the computer to perform a method to facilitate es- 
tablishing a quorum for a cluster within a plurality of 
computers that are geographically distributed, the 

10 method comprising: 

detecting a change in membership of the clus- 
ter at a computer within the plurality of comput- 
ers; and 

15 upon detecting the change in membership, 

forming a potential new cluster by at- 
tempting to communicate with all other comput- 
ers within the plurality of computers, 

accumulating votes for each computer 
20 successfully contacted, 

attempting to gain control of a quorum 
server located at a site separate from all com- 
puters within the plurality of computers, 

if successful, accumulating the quorum 
25 server's votes, and 

if the total of accumulated votes repre- 
sents a majority of the available votes, forming 
a new cluster from the potential new cluster. 

30 14. A computer-readable storage medium storing in- 
structions that when executed by a computer cause 
the computer to perform a method to facilitate es- 
tablishing a quorum for a cluster within a plurality of 
computers that are geographically distributed, the 

35 method comprising: 

providing a quorum server at a site separate 
from the location of a computer within the plu- 
rality of computers; 
40 assigning at least one vote to each computer 

within the plurality of computers; 
assigning at least one vote to the quorum serv- 
er; 

attempting to establish communications be- 
45 tween each pair of computers within the plural- 

ity of computers; 

accumulating a count of votes for each compu- 
ter communicated with at each computer; 
attempting to establish control overthe quorum 
50 server from each computer within the plurality 

of computers; 

if control is established overthe quorum server, 
accumulating the quorum server's vote(s) in the 
count of votes; and 
55 establishing the quorum when a majority of the 

available votes has been accumulated in the 
count of votes. 
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15. A system to facilitate establishing a quorum for a 
cluster within a plurality of computers that are geo- 
graphically distributed, wherein the plurality of com- 
puters are coupled together by a network, the sys- 
tem comprising: 5 

a quorum server located at a site separate from 
any one computer of the plurality of computers; 
and 

an independent communications link for cou- 10 
pling each computer of the plurality of comput- 
ers to the quorum server. 

16. The system of claim 15, wherein the quorum server 
includes a mechanism for granting control to only 15 
one computer of the plurality of computers request- 
ing control. 

17. The system of claim 15, wherein the quorum server 
includes a mechanism for maintaining a list of com- 20 
puters accepted into the cluster. 

18. The system of any of claims 15 to 17, wherein the 
quorum server includes a mechanism for crypto- 
graphically ensuring an identity of a computer at- 25 
tempting to establish control. 

19. The system of any of claims 15 to 18, wherein the 
quorum server includes monitoring means to mon- 
itor the status of each computer within the plurality 30 
of computers. 
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